After having completed this chapter you will be able to:
Perform dimensionality reduction (PCA, UMAP) on single-cell data
Integrate multiple samples to remove batch effects
Cluster cells and evaluate clustering results
Annotate cell types using marker genes and automated methods
Visualize and interpret single-cell analysis results
Dimensionality Reduction
Load the scData dataset you have created in Part 1 and load the :
library(Seurat)
Loading required package: SeuratObject
Loading required package: sp
'SeuratObject' was built under R 4.5.0 but the current version is
4.5.1; it is recomended that you reinstall 'SeuratObject' as the ABI
for R may have changed
Attaching package: 'SeuratObject'
The following objects are masked from 'package:base':
intersect, t
Once the data is normalized, scaled and variable features have been identified, we can start to reduce the dimensionality of the data. For the PCA, by default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to specify a vector of genes. The PCA will only be run on the variable features, that you can check with VariableFeatures(scData).
scData <- Seurat::RunPCA(scData)
To view the PCA plot:
Seurat::DimPlot(scData, reduction ="pca")
We can colour the PCA plot according to any factor that is present in @meta.data, or for any gene. For example we can take the column percent.globin:
Seurat::FeaturePlot(scData, reduction ="pca", features ="percent.mito")
Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
Note that we used a different plotting function here: FeaturePlot. The difference between DimPlot and FeaturePlot is that the first allows you to color the points in the plot according to a grouping variable (e.g. sample) while the latter allows you to color the points according to a continuous variable (e.g. gene expression).
Generate a PCA plot where color is according to counts of a gene (i.e. gene expression). For example, you can take HBA1 (alpha subunit of hemoglobin), or one of the most variable genes (e.g. IFIT2):
Seurat::FeaturePlot(scData, reduction ="pca", features ="IFIT2")
We can generate heatmaps according to their principal component scores calculated in the rotation matrix:
The elbowplot can help you in determining how many PCs to use for downstream analysis such as UMAP:
Seurat::ElbowPlot(scData, ndims =40) +geom_vline(xintercept =10, linetype ="dashed", color ="red")
The elbow plot ranks principle components based on the percentage of variance explained by each one. Where we observe an “elbow” or flattening curve, the majority of true signal is captured by this number of PCs, eg around 25 PCs for the scData dataset.
Including too many PCs usually does not affect much the result, while including too few PCs can affect the results very much.
UMAP: The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.
scData <- Seurat::RunUMAP(scData, dims =1:10)
To view the UMAP plot:
Seurat::DimPlot(scData, reduction ="umap")
Try changing the color of the dots in the UMAP according to a variable (e.g. percent.globin or HBA1):
Seurat::FeaturePlot(scData, features =c("IFIT2","percent.globin","percent.mito"))
You can experiment with UMAP parameters such as the number of neighbors:
# Try different number of neighbors (default is 30)scData <- Seurat::RunUMAP(scData, dims =1:10, n.neighbors =50)
Seurat::DimPlot(scData, reduction ="umap")
Reset to standard parameters for consistency:
scData <- Seurat::RunUMAP(scData, dims =1:10)
Integration
Let’s have a look at the UMAP again. Although cells of different samples are shared amongst ‘clusters’, you can still see separation within the clusters:
Seurat::DimPlot(scData, reduction ="umap")
To perform the integration, we split our object by sample, resulting into a set of layers within the RNA assay. The layers are integrated and stored in the reduction slot - in our case we call it integrated.cca. Then, we re-join the layers
scData[["RNA"]] <-split(scData[["RNA"]], f = scData$orig.ident)scData <- Seurat::IntegrateLayers(object = scData, method = CCAIntegration,orig.reduction ="pca",new.reduction ="integrated.cca",verbose =FALSE)# re-join layers after integrationscData[["RNA"]] <-JoinLayers(scData[["RNA"]])
We can then use this new integrated matrix for clustering and visualization. Now, we can re-run and visualize the results with UMAP.
Create the UMAP again on the integrated.cca reduction:
The method implemented in Seurat first constructs a SNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset.
To cluster the cells, Seurat next implements modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters.
Based on resolution of 0.4, the tree stays relatively stable for a few resolution steps, and it seems that clustering fits the UMAP well. Let’s set this as our default clustering:
scData <- Seurat::SetIdent(scData, value = scData$RNA_snn_res.0.4)
Cell Annotation
Load the following packages:
library(celldex)library(SingleR)
During cell annotation we will use the original count data (not the integrated data):
DefaultAssay(scData) <-"RNA"
Based on the UMAP we have generated, we can visualize expression for a gene in each cluster:
Seurat::FeaturePlot(scData, "HBA1")
Based on expression of sets of genes you can do a manual cell type annotation. If you know the marker genes for some cell types, you can check whether they are up-regulated in one or the other cluster. Here we have some marker genes for different cell types:
Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
ℹ Please use `rlang::check_installed()` instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
Have a look at the B cell and monocyte genes as well:
After running AddModuleScore, a column was added to scData@meta.data called tcell_genes1. It contains the module score for each cell (which is basically the average expression of the set of genes).
You can plot the UMAP with
Seurat::FeaturePlot(scData, "tcell_genes1")
Which shows these genes are mainly expressed in certain clusters:
Seurat::VlnPlot(scData,"tcell_genes1")
Annotating cells according to cycling phase
Based on the same principle, we can also annotate cell cycling state. The function CellCycleScore uses AddModuleScore to get a score for the G2/M and S phase (the mitotic phases in which cell is cycling). In addition, CellCycleScore assigns each cell to either the G2/M, S or G1 phase.
First we extract the built-in genes for cell cycling:
To do a fully automated annotation, we need a reference dataset of primary cells. Here we are using a hematopoietic reference dataset from celldex. Check out what’s in there:
DataFrame with 6 rows and 4 columns
scores labels
<matrix> <character>
Donor1_AAACCCTGTGACGAGT-1 0.248058:0.138436:0.359456:... CD4+ T cells
Donor1_AAACGAATCAGGCTAC-1 0.330869:0.203598:0.462598:... CD4+ T cells
Donor1_AAACGACAGATTGACT-1 0.310532:0.300673:0.270932:... Monocytes
Donor1_AAACGATGTCTTGAAC-1 0.307333:0.194588:0.431520:... CD4+ T cells
Donor1_AAACGATGTGCGCGAA-1 0.497694:0.363757:0.337637:... B cells
Donor1_AAACGATGTTAGCCCA-1 0.270457:0.173731:0.375372:... CD4+ T cells
delta.next pruned.labels
<numeric> <character>
Donor1_AAACCCTGTGACGAGT-1 0.1484061 CD4+ T cells
Donor1_AAACGAATCAGGCTAC-1 0.0822642 CD4+ T cells
Donor1_AAACGACAGATTGACT-1 0.0647324 Monocytes
Donor1_AAACGATGTCTTGAAC-1 0.1204955 CD4+ T cells
Donor1_AAACGATGTGCGCGAA-1 0.1339370 B cells
Donor1_AAACGATGTTAGCCCA-1 0.0897359 CD4+ T cells
Visualize singleR score quality scores:
SingleR::plotScoreHeatmap(scData_SingleR)
SingleR::plotDeltaDistribution(scData_SingleR)
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
There are some annotations that contain only a few cells. They are usually not of interest, and they clogg our plots. Therefore we remove them from the annotation:
We can check out how many cells per sample we have for each annotated cell type:
dittoSeq::dittoBarPlot(scData, var ="SingleR_annot", group.by ="orig.ident")
Compare our manual annotation (based on the set of T cell genes) to the annotation with SingleR. We can have a look at the mean module score for each SingleR annotation like this:
dittoSeq::dittoBarPlot(scData, var ="SingleR_annot", group.by ="RNA_snn_res.0.4")